Motor Vehicle Collisions - Crashes Analysis

Introduction to Dataset

The Motor Vehicle Collisions crash table contains details on the crash event. Each row represents a crash event. The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage (https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/documents/ny_overlay_mv-104an_rev05_2004.pdf). It should be noted that the data is preliminary and subject to change when the MV-104AN forms are amended based on revised crash details.For the most accurate, up to date statistics on traffic fatalities, please refer to the NYPD Motor Vehicle Collisions page (updated weekly) or Vision Zero View (updated monthly).

Today's Outline

6 Step Machine Learning Modelling Framework

More specifically, we'll look at the following topics.

  • Exploratory data analysis (EDA) - the process of going through a dataset and finding out more about it.
  • Feature Analysis - since we're predicting the presence of heart disease, are there some things which are more important for prediction?
  • Reporting what we've found - if we had to present our work, what would we show someone?

To work through these topics, we'll use pandas, Matplotlib and NumPy for data anaylsis, as well as, Scikit-Learn for machine learning and modelling tasks.

Tools which can be used for each step of the machine learning modelling process.

1. Problem Definition

There really is no particular problem we're trying to solve. We're just analyzing data and looking for insights that we can use to draw conclusions.

There really is no particular problem we're trying to solve. We're just analyzing data and looking for insights that we can use to draw conclusions.

Questions

  1. Total Number of People Killed per Borough?
  2. Number of People Injured VS Killed per Month?
  3. Number of Accidents per Month?
  4. Which Borough has the highest number of deaths by collisions?
  5. What Are The Top 5 Causes of Collisions In Brooklyn, Queens & Manhattan?
  6. What is the weekly collisions spread over every hour?
  7. What is the collisions frequency every hour of the week?
  8. How do collisions that cause injuries vary by street ?

Import Necessary Tools

In [139]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import chart_studio.plotly as py
import plotly.graph_objs as go

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import warnings
warnings.filterwarnings('ignore')
import time
import networkx as nx
from plotly.graph_objs import *

2. Data

What you'll want to do here is dive into the data your problem definition is based on. This may involve, sourcing, defining different parameters, talking to experts about it and finding out what you should expect.

We'll look for Attributes (also called features) and variables that we'll potentially use to predict our target variable.

Attributes and features are also referred to as independent variables and a target variable can be referred to as a dependent variable.

Load Data

In [160]:
df = pd.read_csv('Datasets/Motor_Vehicle_Collisions_-_Crashes.csv')

collisionsDF = df.copy()
df_copy = df.copy()
df_copy3 = df.copy()
all_dataC = df.copy()
df.head().T
Out[160]:
0 1 2 3 4
CRASH DATE 08/18/2020 09/25/2020 09/05/2020 09/30/2020 08/19/2020
CRASH TIME 12:00 0:01 19:25 14:00 12:36
BOROUGH BROOKLYN NaN STATEN ISLAND BROOKLYN BRONX
ZIP CODE 11214 NaN 10305 11235 10472
LATITUDE 40.601 NaN 40.6044 40.5864 40.8307
LONGITUDE -73.9935 NaN -74.069 -73.9479 -73.8671
LOCATION (40.60101, -73.99352) NaN (40.604397, -74.069) (40.586376, -73.9479) (40.830738, -73.86707)
ON STREET NAME NaN TRIBOROUGH BRIDGE FINGERBOARD ROAD VOORHIES AVENUE NaN
CROSS STREET NAME NaN NaN NARROWS ROAD NORTH EAST 21 STREET NaN
OFF STREET NAME 15 BAY 31 STREET NaN NaN NaN 1231 SAINT LAWRENCE AVENUE
NUMBER OF PERSONS INJURED 0 0 0 2 0
NUMBER OF PERSONS KILLED 0 0 0 0 0
NUMBER OF PEDESTRIANS INJURED 0 0 0 0 0
NUMBER OF PEDESTRIANS KILLED 0 0 0 0 0
NUMBER OF CYCLIST INJURED 0 0 0 0 0
NUMBER OF CYCLIST KILLED 0 0 0 0 0
NUMBER OF MOTORIST INJURED 0 0 0 2 0
NUMBER OF MOTORIST KILLED 0 0 0 0 0
CONTRIBUTING FACTOR VEHICLE 1 Backing Unsafely Other Vehicular Driver Inattention/Distraction Driver Inattention/Distraction Unspecified
CONTRIBUTING FACTOR VEHICLE 2 Unspecified NaN Unspecified Unspecified Unspecified
CONTRIBUTING FACTOR VEHICLE 3 NaN NaN Unspecified NaN NaN
CONTRIBUTING FACTOR VEHICLE 4 NaN NaN NaN NaN NaN
CONTRIBUTING FACTOR VEHICLE 5 NaN NaN NaN NaN NaN
COLLISION_ID 4342387 4351600 4346201 4353455 4340085
VEHICLE TYPE CODE 1 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle Sedan Pick-up Truck Station Wagon/Sport Utility Vehicle
VEHICLE TYPE CODE 2 Station Wagon/Sport Utility Vehicle NaN Sedan Motorscooter Station Wagon/Sport Utility Vehicle
VEHICLE TYPE CODE 3 NaN NaN Sedan NaN NaN
VEHICLE TYPE CODE 4 NaN NaN NaN NaN NaN
VEHICLE TYPE CODE 5 NaN NaN NaN NaN NaN

EDA (Exploratory Data Analysis)

In [143]:
df.shape
df.describe()
df.info()
Out[143]:
(1734272, 29)
In [48]:
df.columns
Out[48]:
Index(['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE',
       'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME',
       'OFF STREET NAME', 'NUMBER OF PERSONS INJURED',
       'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
       'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
       'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
       'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
       'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
       'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
       'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
       'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'],
      dtype='object')
In [49]:
#Number of accidents per Borough
df.groupby('BOROUGH').size()
Out[49]:
BOROUGH
BRONX            171786
BROOKLYN         376159
MANHATTAN        281857
QUEENS           321879
STATEN ISLAND     51026
dtype: int64
In [51]:
df['ZIP CODE'].isna().sum()
Out[51]:
531775
In [142]:
# check missing values
df.isnull().any()
Out[142]:
CRASH DATE                       False
CRASH TIME                       False
BOROUGH                           True
ZIP CODE                          True
LATITUDE                          True
LONGITUDE                         True
LOCATION                          True
ON STREET NAME                    True
CROSS STREET NAME                 True
OFF STREET NAME                   True
NUMBER OF PERSONS INJURED         True
NUMBER OF PERSONS KILLED          True
NUMBER OF PEDESTRIANS INJURED    False
NUMBER OF PEDESTRIANS KILLED     False
NUMBER OF CYCLIST INJURED        False
NUMBER OF CYCLIST KILLED         False
NUMBER OF MOTORIST INJURED       False
NUMBER OF MOTORIST KILLED        False
CONTRIBUTING FACTOR VEHICLE 1     True
CONTRIBUTING FACTOR VEHICLE 2     True
CONTRIBUTING FACTOR VEHICLE 3     True
CONTRIBUTING FACTOR VEHICLE 4     True
CONTRIBUTING FACTOR VEHICLE 5     True
COLLISION_ID                     False
VEHICLE TYPE CODE 1               True
VEHICLE TYPE CODE 2               True
VEHICLE TYPE CODE 3               True
VEHICLE TYPE CODE 4               True
VEHICLE TYPE CODE 5               True
dtype: bool

Back To Questions That Sparked My Interests

  1. Total Number of People Killed per Borough
  2. Number of People Injured / Killed Per Month?
  3. Number of People Killed Per Borough
  4. Number of People Killed Per Month
  5. What Are the Number of Accidents Per Day
  6. What is the Hourly Rate of Deaths
  7. How is the weekly collisions spread over every hour?
  8. What is the collisions frequency every hour of the week?
  9. How do collisions that cause injuries vary by street ?

2.1 What is the total number of people killed by borough?

In [52]:
print(df[["BOROUGH","NUMBER OF PERSONS INJURED"]].groupby("BOROUGH").count())
               NUMBER OF PERSONS INJURED
BOROUGH                                 
BRONX                             171785
BROOKLYN                          376155
MANHATTAN                         281854
QUEENS                            321876
STATEN ISLAND                      51026
In [55]:
#NUMBER OF PERSONS Killed per BOROUGH
print(df[["BOROUGH","NUMBER OF PERSONS KILLED"]].groupby("BOROUGH").count())
print(df[["BOROUGH","NUMBER OF PEDESTRIANS KILLED"]].groupby("BOROUGH").count())
               NUMBER OF PERSONS KILLED
BOROUGH                                
BRONX                            171779
BROOKLYN                         376154
MANHATTAN                        281851
QUEENS                           321875
STATEN ISLAND                     51025
               NUMBER OF PEDESTRIANS KILLED
BOROUGH                                    
BRONX                                171786
BROOKLYN                             376159
MANHATTAN                            281857
QUEENS                               321879
STATEN ISLAND                         51026
In [56]:
#getting the counts of people killed by borough then rearranging the columns
persk_data=df[["BOROUGH","NUMBER OF PERSONS KILLED"]].groupby("BOROUGH").count()
persk_data['BOROUGH'] = persk_data.index
persk_data=persk_data.reset_index(drop=True)
pedk_data=df[["BOROUGH","NUMBER OF PEDESTRIANS KILLED"]].groupby("BOROUGH").count()
pedk_data['BOROUGH'] = pedk_data.index
pedk_data=pedk_data.reset_index(drop=True)
motk_data=df[["BOROUGH","NUMBER OF MOTORIST KILLED"]].groupby("BOROUGH").count()
motk_data['BOROUGH'] = motk_data.index
motk_data=motk_data.reset_index(drop=True)
In [141]:
s1 = persk_data.set_index('BOROUGH')['NUMBER OF PERSONS KILLED']
s2 = pedk_data.set_index('BOROUGH')['NUMBER OF PEDESTRIANS KILLED']
s3 = motk_data.set_index('BOROUGH')['NUMBER OF MOTORIST KILLED']
df1 = (s1+s2+s3).reset_index(name='Total People Killed')
print (df1)
         BOROUGH  Total People Killed
0          BRONX               515351
1       BROOKLYN              1128472
2      MANHATTAN               845565
3         QUEENS               965633
4  STATEN ISLAND               153077
In [144]:
import matplotlib.pyplot as plt

plt.rcdefaults()
fig, ax = plt.subplots()
y_pos= np.arange(df1.shape[0])
y=df1["Total People Killed"].values
names=df1["BOROUGH"].values
ax.barh(y_pos, y, align='center',color='green', ecolor='black')
ax.set_yticks(y_pos)
ax.set_yticklabels(names)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('People Killed in Accident')
ax.set_title('Number of People Killed Per Borough')

for i, v in enumerate(y):
    ax.text(v + 3, i + .25, str(v), color='Green', fontweight='bold')
plt.show()

2.2 What is the number of people injured vs killed per month?

In [22]:
df_copy['DATE'] = pd.to_datetime(df_copy['CRASH DATE'], format="%m/%d/%Y")
day_data = df_copy\
                .filter(items=['DATE','NUMBER OF PERSONS INJURED','NUMBER OF PERSONS KILLED'])\
                .groupby([df_copy.DATE.dt.year, df_copy.DATE.dt.month, df_copy.DATE.dt.day])\
                .sum()
day_data['date'] = day_data.index
day_data['date'] = pd.to_datetime(day_data['date'], format="(%Y, %m, %d)")
day_data = day_data.reset_index(drop=True)
In [23]:
day_data=day_data.tail(100)
In [24]:
# # import plotly
# import plotly.plotly as py
# import plotly.graph_objs as go

# # these two lines are what allow your code to show up in a notebook!
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
# making a chart
trace1 = go.Scatter(
    x=day_data.date, # assign x as the dataframe column 'x'
    y=day_data['NUMBER OF PERSONS INJURED'],
    name='INJURED'
)
trace2 = go.Scatter(
    x=day_data.date,
    y=day_data['NUMBER OF PERSONS KILLED'],
    name='KILLED'
)

plotted_data = [trace1, trace2]
layout = dict(title="Number of persons injured/killed per month",
              xaxis=dict(title='Date', ticklen=10, zeroline=False))
fig = dict(data=plotted_data, layout=layout)
iplot(fig)

2.3 Number of Accidents per Month

In [32]:
#delete empty rows
df=df.dropna(how='all')

#reset index to remove missing indexes
df=df.reset_index(drop=True)


#split date from "DATE" column
df['Date']=df['CRASH DATE'].apply(lambda x: x[:10])
In [33]:
# parse dates
for i in range(len(df['Date'])):
    if(df['Date'][i].startswith('00')):
        p=df['Date'][i-1].split('-')
        current=df['Date'][i].split('-')
        current[0]=p[0]
        df['Date'][i]="-".join(current)
df['Date'] = pd.to_datetime(df['Date'], format = "%m/%d/%Y")
df['Date'][:5]
Out[33]:
0   2020-08-18
1   2020-09-25
2   2020-09-05
3   2020-09-30
4   2020-08-19
Name: Date, dtype: datetime64[ns]
In [34]:
# count of accidents per month
acc_per_month = df['Date'].groupby([df.Date.dt.year, df.Date.dt.month]).agg('count') 

# convert to dataframe
acc_per_month = acc_per_month.to_frame()

# move date month from index to column
acc_per_month['date'] = acc_per_month.index

# rename column
acc_per_month = acc_per_month.rename(columns={acc_per_month.columns[0]:"acc"})

# re-parse dates
acc_per_month['date'] = pd.to_datetime(acc_per_month['date'], format="(%Y, %m)")

# remove index
acc_per_month = acc_per_month.reset_index(drop=True)

# get month of meet
acc_per_month['month'] = acc_per_month.date.dt.month
In [154]:
# these two lines are what allow code to show up in a notebook!
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()

# date on the x axis and accidents on the y axis
dataPlot2 = [go.Scatter(x=acc_per_month.date, y=acc_per_month.acc)]

# the layout of figure
layout = dict(title = "Number of Accidents per Month",
              xaxis= dict(title= 'Accidents',ticklen= 2,zeroline= False))

# create and show figure
fig = dict(data = dataPlot2, layout = layout)
iplot(fig)

In [153]:
# def getAccidentsAndPlot(borough):
#     dfs = pd.read_csv('Datasets/Motor_Vehicle_Collisions_-_Crashes.csv')
#     acc_per_day = pd.DataFrame()
#     #Number of accidents per day per in Brooklyn
#     dfs = dfs[dfs["BOROUGH"]==borough]
#     # count of accidents per day in Brooklyn
#     acc_per_day = dfs['CRASH DATE'].groupby([dfs['CRASH DATE'].dt.year, dfs['CRASH DATE'].dt.month,dfs['CRASH DATE'].dt.day]).agg('count') 

#     # convert to dataframe
#     acc_per_day = acc_per_day.to_frame()

#     # move date day from index to column
#     acc_per_day['date'] = acc_per_day.index

#     # rename column
#     acc_per_day = acc_per_day.rename(columns={acc_per_day.columns[0]:"acc"})

#     # re-parse dates
#     acc_per_day['date'] = pd.to_datetime(acc_per_day['date'], format="(%Y, %m, %d)")

#     # remove index
#     acc_per_day = acc_per_day.reset_index(drop=True)

#     # get day of meet
#     acc_per_day['day'] = acc_per_day.date.dt.day

#     # date on the x axis and accidents on the y axis
#     dataPlot3 = [go.Scatter(x=acc_per_month.date, y=acc_per_month.acc)]

#     # the layout of figure
#     layout = dict(title = "Number of Accidents per Day",
#                   xaxis= dict(title= borough,ticklen= 2,zeroline= False))

#     # create and show figure
#     fig = dict(data = dataPlot3, layout = layout)
#     iplot(fig)

2.4 Which Borough has the highest number of deaths by collisions?

In [155]:
# these two lines are what allow code to show up in a notebook!
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()

dataPlot = [go.Histogram(histfunc = "sum",x=names,y=y)]

# the layout of figure
layout = dict(title = "Number of Deaths Per Borough",
              xaxis= dict(title= 'Borough',ticklen= 2,zeroline= False))

# create and show figure
fig = dict(data = dataPlot, layout = layout)
iplot(fig)

Let's Focus On Brooklyn, Since it has the largest number of deaths in New York

2.5 What Are The Top 5 Causes of Collisions In Brooklyn?

In [156]:
#Number of Accidents Pe Day in Queens
# getAccidentsAndPlot("QUEENS")
In [62]:
all_dataC.head(3)
Out[62]:
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
0 08/18/2020 12:00 BROOKLYN 11214 40.601010 -73.99352 (40.60101, -73.99352) NaN NaN 15 BAY 31 STREET ... Unspecified NaN NaN NaN 4342387 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle NaN NaN NaN
1 09/25/2020 0:01 NaN NaN NaN NaN NaN TRIBOROUGH BRIDGE NaN NaN ... NaN NaN NaN NaN 4351600 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
2 09/05/2020 19:25 STATEN ISLAND 10305 40.604397 -74.06900 (40.604397, -74.069) FINGERBOARD ROAD NARROWS ROAD NORTH NaN ... Unspecified Unspecified NaN NaN 4346201 Sedan Sedan Sedan NaN NaN

3 rows × 29 columns

In [157]:
all_dataC.isnull().any()
Out[157]:
year                         False
BOROUGH                       True
collision_factor             False
COLLISION_ID                 False
NUMBER OF PERSONS INJURED     True
dtype: bool

Data Cleaning

In [76]:
#dtypes
all_dataC['CRASH TIME'] = pd.to_datetime(all_dataC['CRASH TIME'])
all_dataC['ZIP CODE'] = pd.to_numeric(all_dataC['ZIP CODE'], errors='coerce')
#rename
all_dataC = all_dataC.rename(columns={'CONTRIBUTING FACTOR VEHICLE 1': 'collision_factor'})
#add count column
all_dataC['count']=1
#extract datetime data to create 3 additional columns
all_dataC['year'] = all_dataC['CRASH TIME'].dt.year
#not null or unspecified
all_dataC = all_dataC[pd.notnull(all_dataC['collision_factor'])]
all_dataC = all_dataC.drop(all_dataC[all_dataC['collision_factor']=='Unspecified'].index)
all_dataC['year'] = all_dataC['CRASH TIME'].dt.year
#select columns
all_dataC = all_dataC[['year', 'BOROUGH', 'collision_factor', 'COLLISION_ID', 'NUMBER OF PERSONS INJURED']]
In [112]:
all_dataC.head()
Out[112]:
year BOROUGH collision_factor COLLISION_ID NUMBER OF PERSONS INJURED
0 2020 BROOKLYN Backing Unsafely 4342387 0.0
1 2020 NaN Other Vehicular 4351600 0.0
2 2020 STATEN ISLAND Driver Inattention/Distraction 4346201 0.0
3 2020 BROOKLYN Driver Inattention/Distraction 4353455 2.0
5 2020 BRONX Driver Inattention/Distraction 4353187 0.0

Brooklyn

In [133]:
#top collision factors by year, by borough
df_agg = all_dataC.groupby(['BOROUGH','year', 'collision_factor']).sum().reset_index()
# g = df_agg['count'].groupby(level=0, group_keys=False)
brooklyn = df_agg[df_agg['BOROUGH']=='BROOKLYN']

brooklyn.sort_values(by=['NUMBER OF PERSONS INJURED'], ascending=False).head()
Out[133]:
BOROUGH year collision_factor COLLISION_ID NUMBER OF PERSONS INJURED
70 BROOKLYN 2020 Driver Inattention/Distraction 196758140966 18539.0
77 BROOKLYN 2020 Failure to Yield Right-of-Way 73531192039 11732.0
111 BROOKLYN 2020 Traffic Control Disregarded 18755625494 4047.0
80 BROOKLYN 2020 Following Too Closely 44980979570 3701.0
78 BROOKLYN 2020 Fatigued/Drowsy 13791476136 2079.0

Queens

In [137]:
#top collision factors by borough
#top collision factors by year, by borough
df_agg = all_dataC.groupby(['BOROUGH','year', 'collision_factor']).sum().reset_index()
# g = df_agg['count'].groupby(level=0, group_keys=False)
brooklyn = df_agg[df_agg['BOROUGH']=='']

brooklyn.sort_values(by=['NUMBER OF PERSONS INJURED'], ascending=False).head()
Out[137]:
BOROUGH year collision_factor COLLISION_ID NUMBER OF PERSONS INJURED
190 QUEENS 2020 Driver Inattention/Distraction 215701410831 18774.0
197 QUEENS 2020 Failure to Yield Right-of-Way 83004729961 11779.0
231 QUEENS 2020 Traffic Control Disregarded 18909454028 3771.0
200 QUEENS 2020 Following Too Closely 41587980115 3230.0
183 QUEENS 2020 Alcohol Involvement 10771525761 1555.0

Manhattan

In [138]:
#top collision factors by borough
#top collision factors by year, by borough
df_agg = all_dataC.groupby(['BOROUGH','year', 'collision_factor']).sum().reset_index()
# g = df_agg['count'].groupby(level=0, group_keys=False)
brooklyn = df_agg[df_agg['BOROUGH']=='MANHATTAN']

brooklyn.sort_values(by=['NUMBER OF PERSONS INJURED'], ascending=False).head()
Out[138]:
BOROUGH year collision_factor COLLISION_ID NUMBER OF PERSONS INJURED
130 MANHATTAN 2020 Driver Inattention/Distraction 170553523923 11176.0
137 MANHATTAN 2020 Failure to Yield Right-of-Way 34731568263 4742.0
151 MANHATTAN 2020 Other Vehicular 42141153322 1992.0
140 MANHATTAN 2020 Following Too Closely 34158055847 1687.0
171 MANHATTAN 2020 Traffic Control Disregarded 8187444816 1534.0

2.6 What is the weekly collisions spread over every hour?

In [37]:
collisionsDF.head(2)
Out[37]:
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
0 08/18/2020 12:00 BROOKLYN 11214 40.60101 -73.99352 (40.60101, -73.99352) NaN NaN 15 BAY 31 STREET ... Unspecified NaN NaN NaN 4342387 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle NaN NaN NaN
1 09/25/2020 0:01 NaN NaN NaN NaN NaN TRIBOROUGH BRIDGE NaN NaN ... NaN NaN NaN NaN 4351600 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN

2 rows × 29 columns

In [38]:
# Assuming that dataframe has DATE and Time columns sorted (meaning row 0 is latest date and last row represents the earliest date)
# It can be suppossed based on the tail() and head() results
oldest_date = collisionsDF.iloc[-1]['CRASH DATE'] #tail
newest_date = collisionsDF.iloc[0]['CRASH DATE'] #head

newest_date
Out[38]:
'08/18/2020'
In [39]:
#Extract DateTime features
collisionsDF['collisions_datetime'] = pd.to_datetime(collisionsDF['CRASH DATE'], infer_datetime_format = True)   #infer_datetime_format=True) 
collisionsDF['collisions_time'] = pd.to_datetime(collisionsDF['CRASH TIME'], infer_datetime_format = True) 

#collisionsDF.loc[:, 'collisions_datetime'] = collisionsDF['collisions_datetime'].dt.date

collisionsDF.loc[:, 'collisions_weekday'] = collisionsDF['collisions_datetime'].dt.weekday
collisionsDF.loc[:, 'collisions_weekofyear'] = collisionsDF['collisions_datetime'].dt.weekofyear
collisionsDF.loc[:, 'collisions_hour'] = collisionsDF['collisions_time'].dt.hour
collisionsDF.loc[:, 'collisions_minute'] = collisionsDF['collisions_time'].dt.minute
collisionsDF.loc[:, 'collisions_dt'] = (collisionsDF['collisions_datetime'] - collisionsDF['collisions_datetime'].min()).dt.total_seconds()

#collisionsDF['collisions_time'].dtype
collisionsDF.head()
Out[39]:
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5 collisions_datetime collisions_time collisions_weekday collisions_weekofyear collisions_hour collisions_minute collisions_dt
0 08/18/2020 12:00 BROOKLYN 11214 40.601010 -73.99352 (40.60101, -73.99352) NaN NaN 15 BAY 31 STREET ... NaN NaN NaN 2020-08-18 2020-11-30 12:00:00 1 34 12 0 256608000.0
1 09/25/2020 0:01 NaN NaN NaN NaN NaN TRIBOROUGH BRIDGE NaN NaN ... NaN NaN NaN 2020-09-25 2020-11-30 00:01:00 4 39 0 1 259891200.0
2 09/05/2020 19:25 STATEN ISLAND 10305 40.604397 -74.06900 (40.604397, -74.069) FINGERBOARD ROAD NARROWS ROAD NORTH NaN ... Sedan NaN NaN 2020-09-05 2020-11-30 19:25:00 5 36 19 25 258163200.0
3 09/30/2020 14:00 BROOKLYN 11235 40.586376 -73.94790 (40.586376, -73.9479) VOORHIES AVENUE EAST 21 STREET NaN ... NaN NaN NaN 2020-09-30 2020-11-30 14:00:00 2 40 14 0 260323200.0
4 08/19/2020 12:36 BRONX 10472 40.830738 -73.86707 (40.830738, -73.86707) NaN NaN 1231 SAINT LAWRENCE AVENUE ... NaN NaN NaN 2020-08-19 2020-11-30 12:36:00 2 34 12 36 256694400.0

5 rows × 36 columns

In [41]:
#Data Preparation for the weekly analysis
#get count of collisions every hour on every weekday.
sunday = collisionsDF[collisionsDF['collisions_weekday'] == 6]
df_sundayhourlytripcount = sunday.groupby('collisions_hour').count()
monday = collisionsDF[collisionsDF['collisions_weekday'] == 0]
df_mondayhourlytripcount = monday.groupby('collisions_hour').count()
tuesday = collisionsDF[collisionsDF['collisions_weekday'] == 1]
df_tuesdayhourlytripcount = tuesday.groupby('collisions_hour').count()
wednesday = collisionsDF[collisionsDF['collisions_weekday'] == 2]
df_wednesdayhourlytripcount = wednesday.groupby('collisions_hour').count()
thursday =  collisionsDF[ collisionsDF['collisions_weekday'] == 3]
df_thursdayhourlytripcount = thursday.groupby('collisions_hour').count()
friday = collisionsDF[collisionsDF['collisions_weekday'] == 4]
df_fridayhourlytripcount = friday.groupby('collisions_hour').count()
saturday = collisionsDF[collisionsDF['collisions_weekday'] == 5]
df_saturdayhourlytripcount = saturday.groupby('collisions_hour').count()
collisions_hr_x = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
sun_crashcounty = df_sundayhourlytripcount['CRASH TIME']
mon_crashcounty = df_mondayhourlytripcount['CRASH TIME']
tues_crashcounty = df_tuesdayhourlytripcount['CRASH TIME']
wed_crashcounty = df_wednesdayhourlytripcount['CRASH TIME']
thurs_crashcounty = df_thursdayhourlytripcount['CRASH TIME']
fri_crashcounty = df_fridayhourlytripcount['CRASH TIME']
sat_crashcounty = df_saturdayhourlytripcount['CRASH TIME']

df_sundayhourlytripcount.tail(1)
sun_crashcounty.tail(1)

# Create traces
trace1 = go.Scatter(x = collisions_hr_x, y = sun_crashcounty,  mode = "lines+markers", name = 'Sunday')
trace2 = go.Scatter(x = collisions_hr_x, y = mon_crashcounty, name = 'Monday')
trace3 = go.Scatter( x = collisions_hr_x, y = tues_crashcounty, name = 'Tuesday')
trace4 = go.Scatter( x = collisions_hr_x, y = wed_crashcounty, name = 'Wednesday')
trace5 = go.Scatter( x = collisions_hr_x, y = thurs_crashcounty,  name = 'Thursday')
trace6 = go.Scatter( x = collisions_hr_x, y = fri_crashcounty,   name = 'Friday')
trace7 = go.Scatter(x =collisions_hr_x, y = sat_crashcounty,  mode = "lines+markers", name = 'Saturday')
layout = dict(title = 'Weekly Collisions by Hour', 
              xaxis= dict(title= 'Hour of the day',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'Collisions quantity',ticklen= 5,zeroline= False)
             )
linedata = [trace1, trace2, trace3, trace4, trace5, trace6,trace7]
fig = dict(data=linedata, layout=layout)
py.iplot(fig, filename='timeline-lineplot')

2.7 What is the collisions frequency every hour of the week?

In [65]:
#Preparing data for the graph
W_0 = collisionsDF[collisionsDF['collisions_weekday'] == 0].groupby('collisions_hour').count()
W_1 = collisionsDF[collisionsDF['collisions_weekday'] == 1].groupby('collisions_hour').count()
W_2 = collisionsDF[collisionsDF['collisions_weekday'] == 2].groupby('collisions_hour').count()
W_3 = collisionsDF[collisionsDF['collisions_weekday'] == 3].groupby('collisions_hour').count()
W_4 = collisionsDF[collisionsDF['collisions_weekday'] == 4].groupby('collisions_hour').count()
W_5 = collisionsDF[collisionsDF['collisions_weekday'] == 5].groupby('collisions_hour').count()
W_6 = collisionsDF[collisionsDF['collisions_weekday'] == 6].groupby('collisions_hour').count()

#W_0.head()
trace = go.Heatmap(z=[W_0['CRASH TIME'],W_1['CRASH TIME'],W_2['CRASH TIME'],W_3['CRASH TIME'],W_4['CRASH TIME'],W_5['CRASH TIME'],W_6['CRASH TIME']],
                    y=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday'],
                   x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
                     '3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
                  colorscale='Reds',xgap = 10,ygap = 10)

layout = dict(title = 'Collisions per week per hour')
dataheat=[trace]
fig = dict(data = dataheat, layout=layout)
py.iplot(fig, filename='labelled-heatmap')
#Takeaway: Friday seems to have rush hours around 5pm. This may be impacting durations trip around that time. Could be interesing to know what
#is the street name with most collisions in that hour.

2.8 How do collisions that cause injuries vary by street ?

In [71]:
collisionsDF.groupby('ON STREET NAME').count().sort_values(by='NUMBER OF PERSONS INJURED', ascending=False).head(10)['NUMBER OF PERSONS INJURED']
Out[71]:
ON STREET NAME
BROADWAY                            16994
ATLANTIC AVENUE                     15061
BELT PARKWAY                        12895
3 AVENUE                            12285
NORTHERN BOULEVARD                  11831
FLATBUSH AVENUE                      9608
LONG ISLAND EXPRESSWAY               9558
BROOKLYN QUEENS EXPRESSWAY           9429
LINDEN BOULEVARD                     9376
QUEENS BOULEVARD                     9284
Name: NUMBER OF PERSONS INJURED, dtype: int64
In [72]:
TOP5 = collisionsDF.groupby('ON STREET NAME').count().sort_values(by='NUMBER OF PERSONS INJURED', ascending=False).head(5)['NUMBER OF PERSONS INJURED']
TOP5.keys()
Out[72]:
Index(['BROADWAY                        ', 'ATLANTIC AVENUE                 ',
       'BELT PARKWAY                    ', '3 AVENUE                        ',
       'NORTHERN BOULEVARD              '],
      dtype='object', name='ON STREET NAME')
In [87]:
TOP5.keys()[4]
Out[87]:
'NORTHERN BOULEVARD              '
In [77]:
#Lets look at demand per hour.
bar0data = collisionsDF[collisionsDF['ON STREET NAME']==TOP5.keys()[0]].groupby(['collisions_hour']).count().reset_index()
bar1data = collisionsDF[collisionsDF['ON STREET NAME']==TOP5.keys()[1]].groupby(['collisions_hour']).count().reset_index()
bar2data = collisionsDF[collisionsDF['ON STREET NAME']==TOP5.keys()[2]].groupby(['collisions_hour']).count().reset_index()
bar3data = collisionsDF[collisionsDF['ON STREET NAME']==TOP5.keys()[3]].groupby(['collisions_hour']).count().reset_index()
bar4data = collisionsDF[collisionsDF['ON STREET NAME']==TOP5.keys()[4]].groupby(['collisions_hour']).count().reset_index()
In [97]:
trace0 = go.Bar(
    x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
                     '3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
    y= bar0data['NUMBER OF PERSONS INJURED'],
    name='BROADWAY',
    marker=dict(color='rgb(49,130,189)' ))
trace1 = go.Bar(
    x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
                     '3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
    y= bar1data['NUMBER OF PERSONS INJURED'],
    name='ATLANTIC AVENUE',
    marker=dict(color='rgb(255,192,203)',))

trace2 = go.Bar(
    x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
                     '3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
    y= bar2data['NUMBER OF PERSONS INJURED'],
    name='BELT PARKWAY',
    marker=dict(color='rgb(138,43,226)',))
trace3 = go.Bar(
    x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
                     '3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
    y= bar3data['NUMBER OF PERSONS INJURED'],
    name='3 Avenue',
    marker=dict(color='rgb(255,215,0)',))

trace4 = go.Bar(
    x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
                     '3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
    y= bar4data['NUMBER OF PERSONS INJURED'],
    name='NORTHERN BOULEVARD',
    marker=dict(color='rgb(255,0,0)',)) 

datax = [trace0, trace1, trace2, trace3, trace4]
layout = go.Layout(xaxis=dict(tickangle=-45),barmode='group')

fig = go.Figure(data=datax, layout=layout)
py.iplot(fig)

The rush hour at Broadway, Atlantic avenue and 3 avenue is 4 pm while at Northern Boulevard is 5pm. However, between 4pm and 6pm is the range of time with the most hight risk of collision.

Conclusions

Insights

Deaths Per Borough

Brooklyn is Deadly

  • BRONX 171785
  • BROOKLYN 376155
  • MANHATTAN 281854
  • QUEENS 321876
  • STATEN ISLAND 51026

Top Causes of Deaths

  • Driver Inattention/Distraction => True in Brooklyn and NYC Overall Y/Y
  • Failure to Yield Right-of-Way
  • Traffic Control Disregarded (Eat a snickers)

Highest Hours of Death

  • 8-10
  • 3-7

Streets that cause the most casualties

  • Broadway
  • Atlantic Avenue
  • Belt Parkway

Proposal

Since driver inattention and distraction is the highest collision factor Y/Y and most accidents occur during workweeks and business hours, I propose a data driven based public awareness campaign focused on reducing driver distractions.

Real World Initiatives

In April 9th, 2019 the Govenor's Trafic Saftey Committee (GTSC) launched a Distracted Driving PSA Campaign in NYC. The campaign featured on billboards, television, and radio.

My Reccommendation

  • I reccommend airing the campaign during peak commute hours and during holidays during COIVID 19.
  • Using social media in the campaign strategy. We can start by using zip codes with highest death rates
  • Geospatial social media analysis
  • Eat a snickers after work
  • I believe this campaign will have an impact on the number of collisions and that it will see a reduction in the number of injuries from reckless driving.

Future Improvements

  • Implement machine learning: time series analysis
  • 3D Visualizations
  • Heatmap
  • Confidence Intervals
  • Hypothesis Testing